Discovering functionally important sites in proteins

Proteins play important roles in biology, biotechnology and pharmacology, and missense variants are a common cause of disease. Discovering functionally important sites in proteins is a central but difficult problem because of the lack of large, systematic data sets. Sequence conservation can highlight residues that are functionally important but is often convoluted with a signal for preserving structural stability. We here present a machine learning method to predict functional sites by combining statistical models for protein sequences with biophysical models of stability. We train the model using multiplexed experimental data on variant effects and validate it broadly. We show how the model can be used to discover active sites, as well as regulatory and binding sites. We illustrate the utility of the model by prospective prediction and subsequent experimental validation on the functional consequences of missense variants in HPRT1 which may cause Lesch-Nyhan syndrome, and pinpoint the molecular mechanisms by which they cause disease.


FIGURES
Comparing values of the individual features for SBI and non-SBI variants. Using data from the training set, consisting of 9945 individual variants, each raincloud plot shows, for each feature used in the model, the distribution of feature values in the training proteins for variants belonging to the SBI class (as assigned by experiments) or in one of the other three classes (WT-like, total loss and 'low abundance, high activity'). Each plot also shows the data statistics with a boxplot, where the central line represents the median values, the boundaries of the box represent the first quartile (Q1; bottom) and the third quartile (Q3; top), and the boundaries of the whiskers are evaluated by summing to the nearest quartile 1.5 times the inter-quartile range, defined as Q3-Q1. The comparison between the medians is shown by a black line connecting the two medians.. Raw data are also displayed as points under the box plot.

FUNCTIONAL RESIDUE OVERALL
Evolution + Rosetta Functional model

GRB2-SH3
A B GEMME(v)+Rosetta(v) GEMME(v+r+e) Rosetta(v+r+e) GEMME(v+r+e)+wcn(r) Rosetta(v+r+e)+GEMME(v+r+e) Rosetta(v+r+e)+GEMME(v+r+e)+wcn(r)+hydrophobicity_mutation(v) OPTIMIZED Rosetta(v+r+e)+GEMME(v+r+e)+wcn(r)+hydrophobicity(r) Rosetta(v+r+e)+GEMME(v+r+e)+wcn(r)+hydrophobicity_mutation(v)+hydrophobicity_WT(r)+SASA(r)+WT_AA(r) MODELS Supplementary Figure 3. Benchmarks of our functional sites model. (A) Comparison of the predictions of residue classes in the GRB2 SH3 domain using either our optimized model (green) or simply using cutoff values for evolutionary conservation and thermodynamic stability changes (yellow). The three leftmost bars show the results for the subset of functional residues, while the three rightmost series report the results for all the positions predicted. In both the cases precision, recall and F1-score are used as metrics. (B) Comparison between results from our vanilla model (in brown) with vanilla models trained using other sets of features. Results from our final version with optimized hyperparameters are reported (in green). F1 score is shown on the y-axis and the stars highlight the best model for each set (excluding the fully optimized model). The leftmost set reports the results on the test set (with 1989 variant tested) obtained from a 5-fold cross validation procedure, while the other sets show the scores for GRB2-SH3 domain, both for the functional residue subset (64 variants and 5 residues) and for all the residues (1053 variants and 56 residues). The legend reports which features were used with v, r and e representing variants, residues and environment, respectively (see Methods for list of features). (C) Shows the comparison for buried residues (with an exposed surface area of less than 20%) and (D) shows exposed residues. In both C and D squared markers represent positions where the WT residue is hydrophobic and circles indicate nonhydrophobic WT residues. (E) Comparison of experimental k cat /K M,cMUP values and temperature effects during expression. In particular, the T-effect value represents the change in measured catalytic efficiency when the protein was expressed at 23 • C or 37 • C (but with the enzymatic assay performed at 23 • C in both cases). Variants that show a substantial (greater than 10-fold) change nearly all belong to the total-loss category. Points with larger markers represent data for which the T-effect might be underestimated due to experimental limitations. Supplementary Figure 6. Prediction of active site residues for the set of ten enzymes from [43]. The figure shows for each enzyme in the dataset the total number of reported active site positions (white), the subset of these predicted as being functional residues by our model (blue) and the positions predicted to be total loss (red). The rightmost bar shows the cumulative data.

Classification with monomer ΔΔGs
Classification with tetramer ΔΔGs

Hemoglobin subunit β (P68871)
Classification with dimer ΔΔGs Classification with monomer ΔΔGs Supplementary Figure 10. Additional examples of the effect of input structure choice on residue classification in oligomeric proteins. Panels A and B show differences of classification for residues in orotate phosphoribosyltransferase when we use either (A) the monomer or (B) dimer structure as input to the Rosetta ∆∆G calculations. Residues at the interface are shown with van der Walls atomic representation and residues involved in forming the active site at the dimer interface are labelled. Panel C shows, like panel A, a comparison of predictions for human myoglobin and the α and β subunits of human hemoglobin. For human hemoglobin, the left column shows the residue classification using ∆∆G from the monomer, while the right column the classification made with ∆∆G keeping the entire tetrameric structure during the evaluation. Residues at the tetramer interface are shown with van der Walls atomic representation in all the hemoglobin panels; residues at the corresponding positions in myoglobin are highlighted to make comparisons easier.